Introduction

Cécile Alduy and I have been working with a team of undergraduate RAs and with some members of the Stanford University Libraries staff to build a corpus of the public discourse of French presidential candidates in advance of the 2017 elections. In this notebook, I will describe the methods by which we have been converting web pages into corpora for analysis and will provide some sample Python code.

Collaborating with Nicholas Taylor, SUL's Web Archiving Service Manager, and Sarah Sussman, the curator of French and Italian Collections, we have identified several key websites and begun periodic crawls using ArchiveIt. One of the first challenges for any text-mining project that uses web archives as a source is, unsurprisingly, getting the correct text from each website. Although it's possible to simply extract all the text from a web page, there's a lot of extraneous information that we don't want to deal with.

general-purpose solution
WARC structure
warcat
BeautifulSoup
post-processing
corpus building

def extract_text(site_info, input_file, corpus_dir, word_count=0):
  '''
  Extract the actual text from the HTML
  Write out file with text content
  Return extracted metadata about text
  '''
  results = dict()
  try:
    soup = BeautifulSoup(open(input_file, encoding="utf-8"), 'html.parser')
  except UnicodeDecodeError as err:
    # print(input_file + ' is not UTF8', err)
    return

  if soup is None:
    return

  # Skip page if there's a filter and it isn't matched
  if len(site_info['filter']) and not len(soup.select(site_info['filter'])):
    return

  # Fields in CSV with BeautifulSoup select() options
  for item in ['title','date','author','content']:
    results[item] = ''
    if (not len(site_info[item])):
      continue
    contents = soup.select(site_info[item])
    if contents is not None and len(contents):
      # Assume only the first result is relevant
      # BS4 returns a list of results even if only 1 found
      results[item] = clean_string(contents[0].getText())

  results['word_count'] = len(results['content'].split())
  results['filename'] = generate_unique_filename(corpus_dir, site_info['name'], results)
  if os.path.isfile(results['filename']):
    return

  # Save the original URL
  results['url'] = get_original_url(site_info, input_file)

  if (len(results['title']) and results['word_count'] >= int(word_count)):
    # Ensure the path exists
    if not os.path.isdir(os.path.dirname(results['filename'])):
      os.makedirs(os.path.dirname(results['filename']))
    with open(results['filename'], 'w') as content:
      content.write(str(results['content']))
    return results
  return None



In [ ]: